- Probability rules
- Discrete random variables
- Continuous random variables
- Statistical inference
9/1/2020
Probability and statistics let us talk efficiently about things we are unsure about.
All of these involve inferring or predicting unknown quantities!
Random variables are numbers that we are not sure about, but we might have some idea of how to describe its potential outcomes. They represent uncertain events.
If these numbers can take values in a finite or countably infinite set, the random variable is said to be discrete, otherwise it is called continuous.
Example: Suppose we are about to toss two coins. Let \(X\) denote the number of heads. We say that \(X\) is the random variable that stands for the number we are not sure about.
Probability is a language designed to help us talk and think about aggregate properties of random variables.
The key idea is that to each event we will assign a number between 0 and 1 which reflects how likely that event is to occur.
Basic axioms:
Probability distributions describe the behavior of random variables.
Example: \(X\) is the random variable denoting the number of heads in two independent coin tosses.
\(X\) is discrete as we are able to list all the possible outcomes, i.e. \(X \in \{0, 1, 2\}\).
We can describe its behavior through the following probability distribution: \[p(x) = P(X = x) = \begin{cases} 0.25 & \text{if } x = 0 \\ 0.5 & \text{if } x = 1 \\ 0.25 & \text{if } x = 2 \end{cases}\]
Question: What is \(P(X = 0 \text{ and } X = 2)\)? How about \(P(X \geq 1)\)?
In general we want to use probability to address problems involving more than one variable at the time.
Example: returns of a financial portfolio. If we know that the economy will be growing next year, does that change the assessment about the behavior of my returns?
We need to be able to describe what we think will happen to one variable relative to another!
We need the concepts of
conditional,
joint and
marginal distributions.
Example: How are the returns \(S\) of a portfolio impacted by the overall economy?
Let \(E\) denote the performance of the economy next quarter. For simplicity, say \(E = 1\) if the economy is expanding and \(E = 0\) if the economy is contracting. Let’s assume \(P(E = 1) = 0.7\).
A conditional probability is the chance that one thing happens, given that some other thing has already happened.
\(S\) | \(P(S \mid E = 1)\) | \(P(S \mid E = 0)\) |
---|---|---|
1 | 0.05 | 0.20 |
2 | 0.20 | 0.30 |
3 | 0.50 | 0.30 |
4 | 0.25 | 0.20 |
The probability of \(S = 4\) given that the economy is growing is \(0.25\).
\(S\) | \(P(S \mid E = 1)\) | \(P(S \mid E = 0)\) |
---|---|---|
1 | 0.05 | 0.20 |
2 | 0.20 | 0.30 |
3 | 0.50 | 0.30 |
4 | 0.25 | 0.20 |
The conditional distributions tell us about about what can happen to \(S\) for a given value of \(E\). But what about \(S\) and \(E\) jointly?
\[\begin{split} P(S = 4 \text{ and } E = 1) &= P(E = 1) \cdot P(S =4 \mid E = 1) \\ &= 0.70 \cdot 0.25 = 0.175 \end{split}\]
We just saw how to calculate the joint distribution starting from both marginals and conditionals.
If we have a joint distribution, we know everything of the two random variables: both their marginals and their correlation structure.
Two random variable \(X\) and \(Y\) are independent if \[P(Y = y \mid X = x)= P(Y = y) \quad \forall x, y.\]
In other words, knowing \(X\) tells you nothing about \(Y\)!
Remember rule (5)?
\[\begin{split} P(Y = y \text{ and } X = x) &= P(Y = y \mid X = x) P(X = x) \\ &= P(Y = y) P(X = x) \end{split}\]
Example: tossing a coin 2 times. What is the probability of getting \(H\) in the second toss given that we saw a \(T\) in the first one?
Remember rule (5)? \[P(\text{A and B}) = P(A) P(B \mid A) = P(B)P(A \mid B)\]
This means that \[P(B \mid A) = \frac{P(B)P(A \mid B)}{P(A)}\] This is known as Bayes’ Rule, and is used in many real world applications:
The mean or expected value is defined as (for a discrete \(X\)): \[E(X) = \sum_{x \in D} x \cdot P(X = x)\]
We weight each possible value by how likely they are. This provides us with a measure of centrality of the distribution, i.e. a “good” prediction for \(X\)!
Question: what is the mean number of heads in two independent coin tosses? Remember that \[p(x) = P(X = x) = \begin{cases} 0.25 & \text{if } x = 0 \\ 0.5 & \text{if } x = 1 \\ 0.25 & \text{if } x = 2 \end{cases}\]
Revenue | \(P(\text{Revenue})\) |
---|---|
250,000$ | 0.7 |
0$ | 0.138 |
25,000,000$ | 0.162 |
The expected revenue is \[ E[\text{Revenue}] = 0.7 \cdot 250,000\$ + 0.162 \cdot 25,000,000\$ = 4,225,000\$ \]
Should we invest or not?
Revenue | \(P(\text{Revenue})\) |
---|---|
250,000$ | 0.7 |
0$ | 0.138 |
25,000,000$ | 0.162 |
Revenue | \(P(\text{Revenue})\) |
---|---|
3,721,428$ | 0.7 |
0$ | 0.138 |
10,000,000$ | 0.162 |
We add a second investment: the expected revenue is still \(4,225,000\$\). What is the difference?
The variance is defined as (for a discrete \(X\)): \[\text{Var}(X) = \sum_{x \in D} [x - E(X)]^{2} \cdot P(X = x) = \sum_{x \in D} x^{2} \cdot P(X = x) - (E[X])^{2}\]
Weighted average of squared prediction errors. This is a measure of the spread of a distribution. More “risky” distributions have larger variance.
Question: what is the variance for the number of heads in two independent coin tosses? Remember that \[p(x) = P(X = x) = \begin{cases} 0.25 & \text{if } x = 0 \\ 0.5 & \text{if } x = 1 \\ 0.25 & \text{if } x = 2 \end{cases}\]
The normal distribution is the most used probability distribution to describe a continuous random variable.
The area under the curve (probability density function, or p.d.f), gives \[P(X \in [a, b]) = \int_{a}^{b} f(x) \ dx\]
When we say “the normal distribution”, we really mean a family of distributions.
We obtain probability densities in the normal family by shifting the bell curve around and spreading it out (or tightening it up).
\(X \sim \text{N}(\mu, \sigma^{2})\): “normal distribution with mean \(\mu\) and variance \(\sigma^{2}\)”.
The standard normal distribution has mean \(0\) and has variance \(1\), and is usually denoted by \(Z\).
Notation: if the random variable \(Z\) is s.t. \(Z \sim \text{N}(0,1)\), then \[\begin{split} &P(−1 < Z < 1) = 0.68 \\ &P(−1.96 < Z < 1.96) = 0.95 \end{split}\]
For simplicity we will often use \(P(−2 < Z < 2) \approx 0.95\).
In general, \(P( \mu - 2 \sigma < X < \mu + 2 \sigma) \approx 0.95\).
Example: below are the pdfs of \(X_{1} \sim \text{N}(0, 1)\), \(X_{2} \sim \text{N}(3, 1)\), and \(X_{3} \sim \text{N}(0, 16)\). Which pdf goes with which \(X\)?
In \(X \sim \text{N}(\mu, \sigma^{2})\), \(\mu\) is the mean and \(\sigma^{2}\) is the variance.
Standardization: if \(X \sim \text{N}(\mu, \sigma^{2})\), then \[Z = \frac{X - \mu}{\sigma} \sim \text{N}(0, 1)\]
Summary: \(X \sim \text{N}(\mu, \sigma^{2})\):
\[(X, Y) \sim \text{N}_{2}(\boldsymbol{\mu}, \Sigma)\]
\[P(X \in A, Y \in B) = \int_{A} \int_{B} f(x, y)\ dxdy\]
The marginal distributions of a bivariate normal distribution are normal distributions themselves!
How to compute a marginal from the joint?
\[f_{X}(x) = \int_{\mathcal{Y}} f_{(X, Y)}(x, y) \ dy,\] where \(\mathcal{Y}\) is the domain of the random variable \(Y\) relative to \(X = x\).
Check out this example.
\[\begin{split} &E[X] = \int x \cdot f(x) \ dx \\ &\text{Var}[X] = \int x^{2} \cdot f(x) \ dx - (E[X])^{2} \end{split}\]
You will practice all these calculations in the homework!
It measures the linear relationship between two random variables.
\[\rho = \text{Cor}(X, Y) = \iint x y \cdot f(x, y) \ dx dy - E[X] E[Y]\]More on this when we talk about linear regression.
If two variables are independent, then \(\rho = 0\).
The converse is NOT true.
Supervised learning: study of how \(Y\) varies with \(X\)!
Until now, we have supposed to know the probabilities associated to random events.
In practice, we do not know them, but we have data. The role of a statistician is to assume a statistical model, and estimate its optimal parameters given the data.
One way of doing it is maximum likelihood. More on this in the first homework!
Not only we are interested in the optimal parameters of a statistical model. We might also want to quantify the uncertainty around them: confidence intervals!
We assume that the data are independent and identically distributed as a normal distribution: the likelihood is \[L_{\theta}(x_{1}, \dots, x_{n}) = p(x_{1} \mid \theta) \cdot \dots \cdot p(x_{n} \mid \theta)\]
Steps:
If we were to repeat the procedure \(100\) times, \(\mu\) would be in the CI approximately 95 times out of 100 (in the case \(\alpha = 5\%\)).